We will create Tables for Correlations, and graphs for Correlations in R. As always, we will consistently use the Project Mosaic ecosystem of packages in R (mosaic, mosaicData and ggformula).
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr 1.1.4 ✔ readr 2.1.4
✔ forcats 1.0.0 ✔ stringr 1.5.1
✔ ggplot2 3.4.4 ✔ tibble 3.2.1
✔ lubridate 1.9.3 ✔ tidyr 1.3.0
✔ purrr 1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag() masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
```{r}#| label: setup#| message: truelibrary(mosaic) # package for stats, simulations, and basic plots```
Registered S3 method overwritten by 'mosaic':
method from
fortify.SpatialPolygonsDataFrame ggplot2
The 'mosaic' package masks several functions from core packages in order to add
additional features. The original behavior of these functions should not be affected by this.
Attaching package: 'mosaic'
The following object is masked from 'package:Matrix':
mean
The following objects are masked from 'package:dplyr':
count, do, tally
The following object is masked from 'package:purrr':
cross
The following object is masked from 'package:ggplot2':
stat
The following objects are masked from 'package:stats':
binom.test, cor, cor.test, cov, fivenum, IQR, median, prop.test,
quantile, sd, t.test, var
The following objects are masked from 'package:base':
max, mean, min, prod, range, sample, sum
```{r}#| label: setup#| message: truelibrary(ggformula) # package for professional looking plots, that use the formula interface from mosaiclibrary(skimr)```
Attaching package: 'skimr'
The following object is masked from 'package:mosaic':
n_missing
Registered S3 method overwritten by 'GGally':
method from
+.gg ggplot2
```{r}#| label: setup#| message: truelibrary(corrplot) # For Correlogram plots```
corrplot 0.92 loaded
```{r}#| label: setup#| message: truelibrary(broom) # to properly format stat test resultslibrary(mosaicData) # package containing datasetslibrary(NHANES) # survey data collected by the US National Center for Health Statistics (NCHS)```
All R functions seen in the code are clickable links that take you to online documentation about the function. Try!
The Formula interface
Note the standard method for all commands from the mosaic package:
goal( y ~ x | z, data = mydata, …)
With ggformula, one can create any graph/chart using:
gf_geometry(y ~ x | z, data = mydata)
OR
mydata %>% gf_geometry( y ~ x | z )
The second method may be preferable, especially if you have done some data manipulation first! More about this later!
Case Study 1: Galton Dataset from mosaicData
Let us inspect what datasets are available in the package mosaicData. Run this command in your Console:
```{r, eval=FALSE}# Run in Consoledata(package ="mosaicData")```
The popup tab shows a lot of datasets we could use. Let us continue to use the famous Galton dataset and inspect it:
```{r}data("Galton")```
Inspecting the Data
The inspect command already gives us a series of statistical measures of different variables of interest. As discussed previously, we can retain the output of inspect and use it in our reports: (there are ways of dressing up these tables too)
Try help("Galton") in your Console. The dataset is described as:
A data frame with 898 observations on the following variables.
- family a factor with levels for each family
- father the father’s height (in inches)
- mother the mother’s height (in inches)
- sex the child’s sex: F or M
- height the child’s height as an adult (in inches)
- nkids the number of adult children in the family, or, at least, the number whose heights Galton recorded.
There is a lot of Description generated by the mosaic::inspect() command ! Let us also look at the output of skim:
What can we say about the dataset and its variables? How big is the dataset? How many variables? What types are they, Quant or Qual? If they are Qual, what are the levels? Are they ordered levels? Which variables could have relationships with others? Why? Write down these Questions!
Correlations and Plots
What Questions might we have, that we could answer with a Statistical Measure, or Correlation chart?
Questions
How does children’s height correlate with that of father and mother? Is this relationship also affected by sex of the child?
With this question, height becomes our target variable, which we should always plot on the dependent y-axis.
```{r}#| label: ggpairs-1#| message: false# Pulling out the list of Quant variables from NHANESgalton_quant <- galton_describe$quantitativegalton_quant$nameGGally::ggpairs( Galton,# Choose the variables we want to plot forcolumns =c("father", "mother", "height", "nkids"),switch ="both", # axis labels in more traditional locationsprogress =FALSE, # no compute progress messages needed# Choose the diagonal graphs (always single variable! Think!)diag =list(continuous ="barDiag"), # choosing histogram,not density# Choose lower triangle graphs, two-variable graphslower =list(continuous =wrap("smooth", alpha =0.1)),title ="Galton Data Correlations Plot") +theme_bw()```
[1] "father" "mother" "height" "nkids"
We note that children’s height is correlated with that of father and mother. The correlations are both positive, and that with father seems to be the larger of the two. ( Look at the slopes of the lines and the values of the correlation scores. )
Question
What if we group the Quant variables based on a Qual variable, like sex of the child?
```{r}#| label: ggpairs-2 with groups#| message: false# Pulling out the list of Quant variables from NHANESgalton_quant <- galton_describe$quantitativegalton_quant$nameGGally::ggpairs( Galton,mapping =aes(colour = sex), # Colour by `sex`# Choose the variables we want to plot forcolumns =c("father", "mother", "height", "nkids"),switch ="both", # axis labels in more traditional locationsprogress =FALSE, # no compute progress messages neededdiag =list(continuous ="barDiag"),# Choose lower triangle graphs, two-variable graphslower =list(continuous =wrap("smooth", alpha =0.1)),title ="Galton Data Correlations Plot") +theme_bw()```
[1] "father" "mother" "height" "nkids"
The split scatter plots are useful, as is the split histogram for height: Clearly the correlation of children’s height with father and mother is positive for both sex-es. The other plots, and even some of the correlations scores are not all useful! Just shows everything we can compute is not necessarily useful immediately.
In later modules we will see how to plot correlations when the number of variables is larger still.
Clearly height is positively correlated to father and mother; interestingly, height is negatively correlated ( slightly) with nkids.
Question
Let us confirm with a correlation test:
We will use the mosaic function cor_test to get these results:
```{r}#| column: body-outset-rightmosaic::cor_test(height ~ father, data = Galton) %>% broom::tidy() %>% knitr::kable(digits =2,caption ="Children vs Fathers")```
Children vs Fathers
estimate
statistic
p.value
parameter
conf.low
conf.high
method
alternative
0.28
8.57
0
896
0.21
0.33
Pearson’s product-moment correlation
two.sided
```{r}#| column: body-outset-rightmosaic::cor_test(height ~ mother, data = Galton) %>% broom::tidy() %>% knitr::kable(digits =2,caption ="Children vs Mothers")```
Children vs Mothers
estimate
statistic
p.value
parameter
conf.low
conf.high
method
alternative
0.2
6.16
0
896
0.14
0.26
Pearson’s product-moment correlation
two.sided
Correlation Scores and Uncertainty
Note how the mosaic::cor_test() reports a correlation score estimate and the p-value for the same. There is also a confidence interval reported for the correlation score, an interval within which we are 95% sure that the true correlation value is to be found.
Note that GGally::ggpairs() too reports the significance of the correlation scores estimates using *** or **. This indicates the p-value in the scores obtained by GGally; Presumably, there is an internal cor_test that is run for each pair of variables and the p-value and confidence levels are also computed internally.
In both cases, we used the formula \(height \sim other-variable\), in keeping with our idea of height being the dependent, target variable..
We also see the p.value for the estimateed correlation is negligible, and the conf.low/conf.high interval does not straddle \(0\). These attest to the significance of the correlation score.
Question
What does this correlation look when split by sex of Child?
```{r}#| column: body-outset-right#| results: hold# For the sonsmosaic::cor_test(height ~ father,data = Galton %>%filter(sex =="M")) %>% broom::tidy() %>% knitr::kable(digits =2,caption ="Sons vs Fathers")cor_test(height ~ mother, data = Galton %>%filter(sex =="M")) %>% broom::tidy() %>% knitr::kable(digits =2,caption ="Sons vs Mothers")# For the daughterscor_test(height ~ father, data = Galton %>%filter(sex =="F")) %>% broom::tidy() %>% knitr::kable(digits =2,caption ="Daughters vs Fathers")cor_test(height ~ mother, data = Galton %>%filter(sex =="F")) %>% broom::tidy() %>% knitr::kable(digits =2,caption ="Daughters vs Mothers")```
Sons vs Fathers
estimate
statistic
p.value
parameter
conf.low
conf.high
method
alternative
0.39
9.15
0
463
0.31
0.47
Pearson’s product-moment correlation
two.sided
Sons vs Mothers
estimate
statistic
p.value
parameter
conf.low
conf.high
method
alternative
0.33
7.63
0
463
0.25
0.41
Pearson’s product-moment correlation
two.sided
Daughters vs Fathers
estimate
statistic
p.value
parameter
conf.low
conf.high
method
alternative
0.46
10.72
0
431
0.38
0.53
Pearson’s product-moment correlation
two.sided
Daughters vs Mothers
estimate
statistic
p.value
parameter
conf.low
conf.high
method
alternative
0.31
6.86
0
431
0.23
0.4
Pearson’s product-moment correlation
two.sided
The same observation as made above ( p.value and confidence intervals) applies here too and tells us that the estimated correlations are significant.
Visualizing Uncertainty in Correlation Estimates
We can also visualize this uncertainty and the confidence levels in a plot too, using gf_errorbar and a handy set of functions within purrr which is part of the tidyverse. Assuming heights is the target variable we want to correlate every other (quantitative) variable against, we can proceed very quickly as follows: we will first plot correlation uncertainty for one pair of variables to develop the intuition, and then for all variables against the one target variable:
```{r}#| label: Single Correlation Uncertaintymosaic::cor_test(height ~ mother, data = Galton) %>% broom::tidy() %>%# We need a graph not a table # So comment out this line from the earlier code#knitr::kable(digits = 2,caption = "Children vs Mothers")rowid_to_column(var ="index") %>%# Need an index to plot with# Uncertainty as error-barsgf_errorbar(conf.high + conf.low ~ index, linewidth =2) %>%# Estimate as a pointgf_point(estimate ~ index, color ="red", size =6) %>%# Labelsgf_text(estimate ~ index -0.2, label ="Correlation Score = estimate") %>%gf_text(conf.high*0.98~ index -0.25, label ="Upper Limit = estimate + conf.high") %>%gf_text(conf.low*1.04~ index -0.25, label ="Lower Limit = estimate - conf.low") %>%gf_theme(theme_bw())```
We can now do this for all variables against the target variable height, which we identified in our research question. We will use the iteration capabilities offered by the tidyverse package, purrr:
```{r}#| label: corrtest plotsall_corrs <- Galton %>%select(where(is.numeric)) %>%# leave off height to get all the remaining onesselect(- height) %>%# perform a cor.test for all variables against height purrr::map(.x = .,.f = \(x) cor.test(x, Galton$height)) %>%# tidy up the cor.test outputs into a tidy data framemap_dfr(broom::tidy, .id ="predictor") all_corrsall_corrs %>%# arrange the predictors in order of their correlation scores# with the target variable (`height`)# Add errorbars to show uncertainty ranges / confidence intervals# Use errorbar width and linewidth fo emphasisgf_errorbar(conf.high + conf.low ~reorder(predictor, estimate),color =~ estimate,width =0.2,linewidth =~-log10(p.value)) %>%# All correlation estimates as pointsgf_point(estimate ~reorder(predictor, estimate), color ="black") %>%# Reference line at zero correlation scoregf_hline(yintercept =0, color ="grey", linewidth =2) %>%# Themes,Titles, and Scalesgf_labs(x =NULL, y ="Correlation with height in Galton", caption ="Significance = - log10(p.value)") %>%gf_refine(# Scale for colourscale_colour_distiller("Correlation", type ="div", palette ="RdBu"),# Scale for dumbbells!!scale_linewidth_continuous("significance",range =c(0.5,4))) %>%gf_refine(guides(linewidth =guide_legend(reverse =TRUE))) %>%gf_theme(theme_classic())```
We can clearly see the size of the correlations and the confidence intervals marked in this plot. father has somewhat greater correlation with children’s height, as compared to mother. nkids seems to matter very little. This kind of plot will be very useful when we pursue linear regression models.
Question
How can we show this correlation in a set of Scatter Plots + Regression Lines? Can we recreate Galton’s famous diagram?
```{r}#| layout-ncol: 2# For the sonsgf_point(height ~ father, data = Galton %>%filter(sex =="M"),title ="Soms and Fathers") %>%gf_smooth(method ="lm") %>%gf_theme(theme_minimal())gf_point(height ~ mother, data = Galton %>%filter(sex =="M"),title ="Sons and Mothers") %>%gf_smooth(method ="lm") %>%gf_theme(theme_minimal())# For the daughtersgf_point(height ~ father, data = Galton %>%filter(sex =="F"),title ="Daughters and Fathers") %>%gf_smooth(method ="lm") %>%gf_theme(theme_minimal())gf_point(height ~ mother, data = Galton %>%filter(sex =="F"),title ="Daughters and Mothers") %>%gf_smooth(method ="lm") %>%gf_theme(theme_minimal())```
An approximation to Galton’s famous plot1 (see Wikipedia):
```{r}#| layout-ncol: 2gf_point(height ~ (father + mother)/2, data = Galton) %>%gf_smooth(method ="lm") %>%gf_density_2d(n =8) %>%gf_abline(slope =1) %>%gf_theme(theme_minimal())gf_point(height ~ (father + mother)/2, data = Galton) %>%gf_smooth(method ="lm") %>%gf_ellipse(level =0.95, color ="red") %>%gf_ellipse(level =0.75, color ="blue") %>%gf_ellipse(level =0.5, color ="green") %>%gf_abline(slope =1) %>%gf_theme(theme_minimal())```
This is survey data collected by the US National Center for Health Statistics (NCHS) which has conducted a series of health and nutrition surveys since the early 1960’s. Since 1999 approximately 5,000 individuals of all ages are interviewed in their homes every year and complete the health examination component of the survey. The health examination is conducted in a mobile examination centre (MEC).
The dataset is described as: A data frame with 100000 observations on 76 variables. Some of these are:
- Race1 and Race2: factors with 5 and 6 levels respectively
- Education a factor with 5 levels
- HHIncomeMid Total annual gross income for the household in US dollars.
- Age
- BMI: Body mass index (weight/height2 in kg/m2)
- Height: Standing height in cm.
- Weight: Weight in kg > > - Testosterone: Testosterone total (ng/dL) - PhysActiveDays: Number of days in a typical week that participant does moderate or vigorous-intensity activity.
- CompHrsDay: Number of hours per day on average participant used a computer or gaming device over the past 30 days.
Missing Data
Why do so many of the variables have missing entries? What could be your guess about the Experiment/Survey`?
Let us make some counts of the data, since we have so many factors:
Does Testosterone have a relationship with parameters such as BMI, Weight, Height, PhysActiveDaysCompHrsDay and Age?
Does HHIncomeMid have a relationship with these same parameters? And with Gender?
Are there any other pairwise correlations that we should note? (This is especially useful in choosing independent variables for multiple regression)
( Yes we are concerned with men more than with the women, sadly.)
Correlations and Plots
```{r}#| message: false#| warning: false#| label: "Household Median Income"GGally::ggpairs(NHANES, # Choose the variables we want to plot forcolumns =c("HHIncomeMid", "Weight", "Height", "BMI", "Gender"), # LISTs of graphs needed at different locations# For different combinations of variables diag =list(continuous ="barDiag"),lower =list(continuous =wrap("smooth", alpha =0.01)),upper =list(continuous ="cor"),switch ="both", # axis labels in more traditional locationsprogress =FALSE ) +# No compute progress bars neededtheme_bw()```
We see that HHIncomeMid is Quantitative, discrete valued variable, since it is based on a set of median incomes for different ranges of income. BMI, Weight, Height are continuous Quant variables.
HHIncomeMid also seems to be relatively unaffected by Weight; And is only mildly correlated with Height and BMI, as seen both by the correlation score magnitudes and the slopes of the trend lines.
There is a difference in the median income by Gender, but we will defer that kind of test for later, when we do Statistical Inference.
Unsurprisingly, BMI and Weight have a strong relationship, as do Height and Weight; the latter is of course non-linear, since the Height levels off at a point.
It is clear that Testosterone has strong relationships with Height and Weight but not so much with BMI.
Visualizing Uncertainty in Correlation Estimates
Since the pairs plot is fairly clear for both target variables, let us head to visualizing the significance and uncertainty in the correlation estimates.
```{r}#| label: NHANES errorbars-1HHIncome_corrs <- NHANES %>%select(where(is.numeric)) %>%# leave off height to get all the remaining onesselect(- HHIncomeMid) %>%# perform a cor.test for all variables against height purrr::map(.x = .,.f = \(x) cor.test(x, NHANES$HHIncomeMid)) %>%# tidy up the cor.test outputs into a tidy data framemap_dfr(broom::tidy, .id ="predictor") HHIncome_corrsHHIncome_corrs %>%# Reference line at zero correlation scoregf_hline(yintercept =0, color ="grey", linewidth =2) %>%# arrange the predictors in order of their correlation scores# with the target variable (`height`)# Add errorbars to show uncertainty ranges / confidence intervals# Use errorbar width and linewidth fo emphasisgf_errorbar(conf.high + conf.low ~reorder(predictor, estimate),color =~ estimate,width =0.2,linewidth =~-log10(p.value +0.001)) %>%# All correlation estimates as pointsgf_point(estimate ~reorder(predictor, estimate), color ="black") %>%# Themes,Titles, and Scalesgf_labs(x =NULL, y ="Correlations with HouseHold Median Income", caption ="Significance = - log10(p.value)") %>%gf_theme(theme_classic()) %>%# Scale for colourgf_refine(guides(linewidth =guide_legend(reverse =TRUE)),scale_colour_distiller("Correlation", type ="div", palette ="RdBu"),# Scale for dumbbells!!scale_linewidth_continuous("Significance", range =c(0.05,2)),theme(axis.text.y =element_text(size =6, hjust =1)),coord_flip()) ```
If we select just the variables from our Research Question:
```{r}#| label: NHANES errorbars-2HHIncome_corrs_select <- NHANES %>%select(Height, Weight, BMI) %>%# Only change is here!# leave off height to get all the remaining ones#select(- HHIncomeMid) %>% # perform a cor.test for all variables against height purrr::map(.x = .,.f = \(x) cor.test(x, NHANES$HHIncomeMid)) %>%# tidy up the cor.test outputs into a tidy data framemap_dfr(broom::tidy, .id ="predictor") HHIncome_corrs_selectHHIncome_corrs_select %>%# arrange the predictors in order of their correlation scores# with the target variable (`height`)# Add errorbars to show uncertainty ranges / confidence intervals# Use errorbar width and linewidth fo emphasisgf_errorbar(conf.high + conf.low ~reorder(predictor, estimate),color =~ estimate,width =0.2,linewidth =~-log10(p.value +0.000001)) %>%# All correlation estimates as pointsgf_point(estimate ~reorder(predictor, estimate), color ="black") %>%# Reference line at zero correlation scoregf_hline(yintercept =0, color ="grey", linewidth =2) %>%# Themes,Titles, and Scalesgf_labs(x =NULL, y ="Correlations with HouseHold Median Income", caption ="Significance = - log10(p.value + 0.000001)") %>%gf_theme(theme_classic()) %>%# Scale for colourgf_refine(guides(linewidth =guide_legend(reverse =TRUE)),scale_colour_distiller("Correlation", type ="div", palette ="RdBu"),# Scale for dumbbells!!scale_linewidth_continuous("Significance", range =c(0.05,2)),theme(axis.text.y =element_text(size =8, hjust =1)),coord_flip()) ```
So we might say taller people make more money? And fatter people make slightly less money? Well, the magnitude of the correlations (aka effect size) are low so we would not imagine this to be a hypothesis that we can defend.
Let us look at the Testosterone variable: trying all variables shows some paucity of observations ( due to missing data), so we will stick with our chosen variables:
```{r}#| label: testosterone errorbarsTestosterone_corrs <- NHANES %>%select(Height, Weight, BMI) %>%# leave off height to get all the remaining ones#select(- Testosterone) %>%# perform a cor.test for all variables against height purrr::map(.x = .,.f = \(x) cor.test(x, NHANES$Testosterone)) %>%# tidy up the cor.test outputs into a tidy data framemap_dfr(broom::tidy, .id ="predictor")Testosterone_corrsTestosterone_corrs %>%# Reference line at zero correlation scoregf_hline(yintercept =0,color ="grey",linewidth =2) %>%# arrange the predictors in order of their correlation scores# with the target variable (`height`)# Add errorbars to show uncertainty ranges / confidence intervals# Use errorbar width and linewidth fo emphasisgf_errorbar( conf.high + conf.low ~reorder(predictor, estimate),color =~ estimate,width =0.2,linewidth =~-log10(p.value +0.000001) ) %>%# All correlation estimates as pointsgf_point(estimate ~reorder(predictor, estimate),color ="black") %>%# Themes,Titles, and Scalesgf_labs(x =NULL, y ="Correlations with Testosterone Levels",caption ="Significance = - log10(p.value + 0.000001)") %>%gf_theme(theme_classic()) %>%# Scale for colourgf_refine(guides(linewidth =guide_legend(reverse =TRUE)),scale_colour_distiller("Correlation", type ="div",palette ="RdBu"),# Scale for dumbbells!!scale_linewidth_continuous("Significance", range =c(0.05, 2)),theme(axis.text.y =element_text(size =8, hjust =1)),coord_flip() ) ```
Conclusion
We have a decent Correlations related workflow in R:
Load the dataset
inspect/skim/glimpse the dataset, identify Quant and Qual variables
Identify a target variable based on your knowledge of the data, how it was gathered, who gathered it and what was their intent
Schloerke, Barret, Di Cook, Joseph Larmarange, Francois Briatte, Moritz Marbach, Edwin Thoen, Amos Elberg, and Jason Crowley. 2023. GGally: Extension to “ggplot2”. https://CRAN.R-project.org/package=GGally.
@online{venkatadri2022,
author = {Venkatadri, Arvind},
title = {Tutorial on {Correlations} in {R}},
date = {2022-11-22},
url = {https://av-quarto.netlify.app/content/courses/Analytics/Descriptive/Modules/30-Correlations/files/correlations.html},
langid = {en},
abstract = {Tests, Tables, and Graphs for Correlations in R}
}